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Abstract 

We derive generalization bounds for learning algorithms based on their robustness; the 
property that if a testing sample is "similar" to a training sample, then the testing error is 
close to the training error. This provides a novel approach, different from the complexity 
or stability arguments, to study generalization of learning algorithms. We further show 
that a weak notion of robustness is both sufficient and necessary for gcneralizability, which 
implies that robustness is a fundamental property for learning algorithms to work. 



1. Introduction 

The key issue in the task of learning from a set of observed samples is the estimation of 
the risk (i.e., generalization error) of learning algorithms. Typically, its empirical mea- 
surement (i.e., training error) provides an optimistically biased estimation, especially when 
the number of training samples is small. Several approaches have been proposed to bound 
the deviation of the risk from its empirical measurement, among which methods based on 
uniform convergence and stability are most widely used. 



U niform convergence of empirical quantities to their mean (e.g.. lVapnik and Chervonenkii 



1974I . 



1991 



provides ways to bound the gap between the expected risk and the empirical risk 
by the complexity of the hypothesi s set. Examples to complexity rneasures are the Vapnik - 



Chervonenkis (VC) dimension (e.g.. Vaonik and Cheryonenkid . ll991 : lEvgeniou et al.l . l200Cll ). 



the fat-shat tering dimension fe.g.. lAlon et al 



complexity (JBartlett and Mendelsonl . 120021 : 



11997 



Bartlett 



Bart let t et al. 



i 



19981 ). and the Rademacher 



20051 ). Another well-known ap- 



proach is based on stability. An algorithm is stable if its output remains "similar" for dif- 
ferent sets of training samples that are identical up to removal or cha nge of a single sample. 
The fir st re sults that relate stabil i ty to g cneralizability track b ack tolDevroye and Wagner 
(|l979al ) and lDevrove and Wagneij (|l979bl ). Later, McDiarmid's (lMcDiarmidl.ll989l). concen- 
tration inequalities facilitated new bounds on gene ralization error (e.g.. lBousquet and Elisseefj . 
2OO2I : IPoggio et all , booi iMukheriee et al.l . bood ^ . 
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In this paper we explore a different approach which we term algorithmic robustness. 
Briefly speaking, an algorithm is robust if its solution has the following property: it 
achieves "similar" performance on a testing sample and a tr aining sample that are " c lose" . 
This notion of robustness is rooted in robust optimizatio n (JBen-tal and Nemirovskil . 1 19981 : 
Ben-Tal and Nemirovskl Il999l : iBertsimas and Siml . |2004| ) where a decision maker aims to 
find a solution x that minimizes a (parameterized) cost function f{x,S,) with the knowl- 
edge that the unknown true parameter S, may deviate from the observed parameter ^. 
Hence, instead of solving inmxf{x,^) one solves minx [max; ^ /(x, ^)], where A includes 
all possible realizations of ^. Robu st optimization was i r itroduced in machine learning 



tasks to handle exogenous no ise (e.g-. lBhattacharyya et all l2004l : IShivaswamy et al. 



200e 



Glob er son and Roweii 



20061 ) ■ i.e., the lear ning algor i thm on ly has access to inaccurate ob- 
servation of training samples. Later on, IXu et al.l (J2009bH al) showed that both Support 



Vector Machine(SVM) and Lasso have robust optimization interpretation, i.e., they can be 
reformulated as 

n 

min max > l{h,Zi + 5i), 
fteK((5i,-,5„)eA^ 

2 = 1 

for some A. Here Zi are the observed training samples and /(•,•) is the loss function (hinge- 
loss for SVM, and squared loss for Lasso), which means that SVM and Lasso essentially 
m inimize the empir ical error under the worst possible perturbation. Indeed, as the authors 
of lXu et al.l ( 2009bH al) showed, this reformulation leads to requiring that the loss of a sample 
"close" to Zi is small, which further implies statistical consistency of these two algorithms. 
In this paper we adopt this approach and study the (finite sample) generalization ability of 
learning algorithms by investigating the loss of learned hypotheses on samples that slightly 
deviate from training samples. 

Of special interest is that robustness is more than just another way to establish gener- 
alization bounds. Indeed, we show that a weaker notion of robustness is a necessary and 
sufficient condition of (asymptotic) genera lizability of (general) learn i ng al gorithms. While 
it is known having a fin ite VC-dimensiori ( Vapnik and Chervonenkisl . Il99ll ) or equivalently 
being C VEEE/oo stable (JMukherjee et al.l . l2006l ) is necessary and sufficient for the Empirical 
Risk Minimization (ERM) t o generalize, much less is known in the general case. Recently, 
Shalev-Shwartz et al.l (120091 ) proposed a weaker notion of stability that is necessary and 
sufficient for a learning algorithm to be consistent and generalizing, provided that the prob- 
lem itself is learnable. However, learnability requires that the convergence rate is uniform 
with respect to all distributions, and is hence a fairly strong assumption. In particular, 
the standard supervised learning setup where the hypothesis set is the set of measurable 
functions is n o t lear nable since no algorithm can achieve a uniform co r iverge nce rate (cf 



Devrove et al.l . Il996l ). Indeed, as the authors of IShalev-Shwartz et al.l (J2009l ) stated, for 



supervised learning problem learnability is equivalent to the generalizability of ERM, and 
hence reduce to the aforementioned results on ERM algorithms. 
In particular, our main contributions are the following: 



1. We propose a notion of algorithmic robustness. Algorithmic robustness is a desired 
property for a learning algorithm since it implies a lack of sensitivity to (small) dis- 
turbances in the training data. 

2. Based on the notion of algorithmic robustness, we derive generalization bound for IID 
samples as well as samples drawn according to a Markovian chain. 

3. To illustrate the applicability of the notion of algorithmic robustness, we provide some 
examples of robust algorithms, including SVM, Lasso, feed-forward neural networks 
and PCA. 

4. We propose a weaker notion of robustness and show that it is both necessary and 
sufficient for a learning algorithm to generalize. This implies that robustness is an 
essential property needed for a learning algorithm to work. 

Note that while stability and robustness are similar on an intuitive level, there is a 
difference between the two: stability requires that nearly identical training sets with a 
single sample removed lead to similar prediction rules, whereas robustness requires that a 
prediction rule has comparable performance if tested on a sample close to a training sample. 

This paper is organized as follows. We define the notion of robustness in Section [2l and 
prove generalization bounds for robust algorithms in Section [3l In Section U] we propose a 
relaxed notion of robustness, which is termed as pseudo-robustness, and show corresponding 
generalization bounds. Examples of learning algorithms that are robust or pseudo-robust 
are provided in Section [5j Finally, we show that robustness is necessary and sufficient for 
generalizability in Section [6l 

1.1 Preliminaries 

We consider the following general learning model: a set of training samples are given, and 
the goal is to pick a hypothesis from a hypothesis set. Unless otherwise mentioned, through- 
out this paper the size of training set is fixed as n. Therefore, we drop the dependence of 
parameters on the number of training samples, while it should be understood that param- 
eters may vary with the number of training samples. We use Z and Ji to denote the set 
from which each sample is drawn, and the hypothesis set, respectively. Throughout the 
paper we use s to denote the training sample set consists of n training samples (si, • • • , s^). 
A learning algorithm A is thus a mapping from Z"^ to H. We use ^s to represent the 
hypothesis learned (given training set s). For each hypothesis h G Ti and a point z € Z, 
there is an associated loss l{h,z). We ignore the issue of measurability and further assume 
that l{h,z) is non-negative and upper-bounded uniformly by a scalar M. 

In the special case of supervised learning, the sample space can be decomposed as 
Z = y X X, and the goal is to learn a mapping from X to y, i.e., to predict the y- 
component given x-component. We hence use As{x) to represent the prediction oi x £ X 
if trained on s. We call X the input space and y the output space. The output space can 



either be y = {— 1,+1} for a classificatfon problem, or 3^ = M for a regressfon problem. 
We use u and \y to denote the x-component and y-component of a point. For example, Sju 
is the x-component of Sj. To simplify notations, for a scaler c, we use [c]"*" to represent its 
non-negative part, i.e., [c]"*" = max(0,c). 



We recall the following standard notion of covering number from Ivan der Vaart and Wellner 



WW 



Definition 1 (cf. Ivan der Vaart and Wellnerl (I2OOOI )) For a metric space S, p and T C 



S we say that T C S is an e-cover ofT, ifVt € T, 3t & T such that p{t,t) < e. The e- 
covering number of T is 

Af{e,T,p) = min{|r| : T is an e — cover ofT}. 

2. Robustness of Learning Algorithms 

Before providing a precise definition of what we mean by "robustness" of an algorithm, 
we provide some motivating examples which share a common property: if a testing sample 
is close to a training sample, then the testing error is also close, a property we will later 
formalize as "robustness". 

We first consider large-margin classifiers: Let the loss function be l{As, z) = l(^s(-Z|x) 7^ 
z\y). Fix 7 > 0. An algorithm ^s has a margin 7 if for j = 1, • • • , n 

As{x) = As{sj\.j)] Vx : ||x - Sj\x\\2 < 7- 

That is, any training sample is at least 7 away from the classification boundary. 

Example 1 Fix 7 > and put K = 2A/'(7/2, X^ \\ ■ II2). // ^s has a margin 7, then Z can 
he partitioned into K disjoint sets, denoted by {Ci}fLi, such that if Sj and z ^ Z belong to 
a same C-i, then \l{As, Sj) — l{As, z)\ =0. 

Proof By definition of covering number, we can partition X into N'{^/2,X, \\ ■ II2) sub- 
sets (denoted Xi) such that each subset has a diameter less or equal to 7. Further, y 
can be partitioned to {—1} and {+1}. Thus, we can partition Z into 2Af{^/2,X, \\ ■ II2) 
subsets such that if zi,Z2 belong to a same subset, then yi^y = y2\y and \\xi\y — X2\y\\ < 7. 
By definition of margin, this guarantees that if Sj and z ^ Z belong to a same Cj, then 
\l{A^,Sj)-l{A^,z)\={). ■ 

The next example is a linear regression algorithm. Let the loss function be l{As^z) = 
\z\y — As{z\x)\, and let ^ be a bounded subset of W^ and fix c > 0. The norm-constrained 
linear regression algorithm is 



As= min y^\si\y - w^ Siw\, (1) 

i.e., minimizing the empirical error among all linear classifiers whose norm is bounded. 



Example 2 Fixe > Q andput K = J\f{e/2,X,\\-\\2)y.M{e/2,y,\-\). Consider the algorithm 
as in ^^. The set Z can be partitioned into K disjoint sets, such that if Sj and z ^ Z belong 
to a same Ci, then 

\l{A^,Sj)-l{As,z)\ <(c+l)e. 

Proof Similarly to the previous example, we can partition Z toJ\f{e/2, X , \\-\\2)'xM{e/2,y, \- 
I) subsets, such that \i zi,Z2 belong to a same Cj, then ||2;i|a; — •Z2|xl|2 ^ £) and l^^ii^ — ^^21^1 ^ ^■ 
Since \\w\\2 < c, we have 

\l{w,Zl) -l{w{s),Z2)\ = \zi\y - W^ Zi\^\ - \z2\y - W^ Z2\.. 

< {Zi\y - w'^Zil^) - {Z2\y - w'^Z2\x) 
<\Zl\y - Z2\y\ + ||u;||2||2i|^ - Z2\a:\\2 

<(l + c)e, 
whenever zi,Z2 belong to a same Cj. ■ 

The two motivating examples both share a property: we can partition the sample set 
into finite subsets, such that if a new sample falls into the same subset as a testing sample, 
then the loss of the former is close to the loss of the latter. We call an algorithm having 
this property "robust." 

Definition 2 Algorithm A is {K, e(s)) robust if Z can be partitioned into K disjoint sets, 
denoted as {Ci}^-^^, such that Vs G s, 

s,zeCi, =^ \l{As,s)-l{As,z)\<€{s). (2) 

In the definition, both K and the partition sets {Cj}^^ do not depend on the training set 
s. Note that the definition of robustness requires that ([2]) holds for every training sample. 
Indeed, we can relax the definition, so that the condition needs only hold for a subset of 
training samples. We call an algorithm having this property "pseudo robust". See Section [H 
for details. 

3. Generalization of Robust Algorithms 

In this section we investigate generalization property of robust algorithms. In particular, in 
the following subsections we derive PAC bounds for robust algorithms under three different 
conditions: (1) The ubiquitous learning setup where the samples are i.i.d. and the goal of 
learning is to minimize expected loss. (2) The learning goal is to minimize quantile loss. 
(3) The samples are generated according to a (Doeblin) Markovian chain. Indeed, the fact 
that we can provide results in (2) and (3) indicates the fundamental nature of robustness 
as a property of learning algorithms. 



3.1 IID samples and expected loss 

In this section, we consider the standard learning setup, i.e., the sample set s consists of 
n i.i.d. samples generated by an unknown distribution /i, and the goal of learning is to 
minimize expected test loss. Let /(•) and /cmp(') denote the expected error and the training 
error, i.e.. 



n 



SiGs 



Recall that the loss function Z(-, •) is upper bounded by M. 

Theorem 3 Ifs consists of n i.i.d. samples, and A is (K,e(s)) -robust, then for any 6 > 0, 
with probability at least 1 — 6, 



[(A) -Up (A) <e(s) + M 



2Kln2 + 21n(l/(5) 



n 



Proof Let A^j be the set of index of points of s that fall into the Cj . Note that ( | iVi | , • • • , | Nk \ ) 
is an IID multinomial random variable with parameters n and (^(Ci), • • • , n(C fc))- The fol- 



lowin g holds by the Breteganolle-Huber-Carol inequality (cf Proposition A6.6 of lvan der Vaart and Wellnerl . 

2nnnl ): 



K 



P' E 



.1=1 



n 



nX' 



> A ^ < 2^ exp( ). 



Hence, the following holds with probability at least 1 — S, 

K 



E 

i=l 



n 



< 



2/s:in2 + 21n(l/(5) 



n 



(3) 



We have 

K ,. 

= Y,E{l{As,z)\z e Ci)fi{Ci) - -Y,l{^s,s, 



(a) 
< 



i=l 
K 



n 



i=l 



\N-\ 1 



i=l 



n n 



i=l 



+ 



K 



K 



Y,HK^s,z)\z G a)MQ) - J^E(/(A,2)k e Ci 



\Ni 



i=l 



i=l 



< 



(c) 



1 ^^' 

EV] max |/(^s, Sj) - l{As, Z2) 



1=1 jGAf,; 



+ 



max|Z(A,2)iy' 



z^Z 



i=\ 



n 



liV,. 



n 



(4) 



/^(q: 



it 



<e(s) + Mj] 



i=\ 



n 



where (a), (b), and (c) are due to the triangle inequahty, the definition of Ni, and the def- 
inition of e(s) and M, respectively. Note that the right-hand-side of (JH) is upper-bounded 



bye(s)+A/ 



2ft:in2+21n(l/(5) 



with probability at least 1—6 due to ([3]). The theorem follows. 



Theorem [3] requires that we fix a X a priori. However, it is often worthwhile to consider 
adaptive K. For example, in the large-margin classification case, typically the margin is 
known only after s is realized. That is, the value of K depends on s. Because of this 
dependency, we needs a generalization bound that holds uniformly for all K. 

Corollary 4 If s consists of n i.i.d. samples, and A is (iT, eA'(s)) robust for all K > 1, 
then for any 6 > Q, with probability at least 1 — 6, 



H"^sj ^cmpV^^i 



< inf 

K>1 



2i^ln2 + 21n^^(4±i) 



n 



Proof Let 



E{K) ^ < 



/(A)-Up(^s) >eK{s)+M\ ^ 



n 



From Theorem El we have Ft{E{K)) < 6/{K{K + 1)) = 6/K - 6/{K + 1). By the union 
bound we have 



Pr <^ U E{K) 



K>1 



><Y,PHE{K))<Y^ 



K>1 



K>1 



K K + 1 



and the corollary follows. ■ 

If e(s) does not depend on s, we can sharpen the bound given in Corollary [H 

Corollary 5 //s consists of n i.i.d. samples, and A is {K^ek) robust for all K >1, then 
for any 6 > Q, with probability at least 1 — 6, 



^l,"^sj ^empV"^i 



< inf 

K>1 



/2Kln2 + 21ni 



n 



Proof The right hand side does not depend on s, and hence the optimal K* . Therefore, 
plugging K* into Theorem [3] establishes the corollary. ■ 



3.2 Quantile Loss 

So far we considered the standard expected loss setup. In this section we consider some less 
extensively investigated loss functions, namely quantile value and truncated expectation 
(see the following for precise definitions) . These loss functions are of intere st because th ey 



are less sensitive to the presence of outliers than the standard average loss (jHubeiJ . 1 19811 ) . 
Definition 6 For a non-negative random variable X, the /3-quantile value is 

<Q,^{X) = inf {c G M : Pr(X < c) > /?} . 
The /3-truncated mean is 

E [X • 1(X < Q^(X))] i/Pr[X = 



^'(^'H E[.Y.l(X<Q«(.Y))]+tEj|^Q.(X) „,W.,e. 

In words, the /3— quantile loss is the smallest value that is larger or equal to X with proba- 
bility at least /?. The /3-truncated mean is the contribution to the expectation of the left- 
most /3 fraction of the distribution. For example, suppose X is supported on {ci, • • • , cio} 
(ci < C2 < ■ • • < cio) and the probability of taking each value equals 0.1. Then the 
0.63-quantile loss of X is C7, and the 0.63-truncated mean of X equals 0.1(^^^]^ Cj -|- O.Scy). 
Given /i G "H, /3 € (0, 1), and a probability measure /x on Z, let 

Q(/i, /3, /x) ^ Q^(Z(/i, z)); where: z ~ /i; 

and 

T(/i, /?, /i) = T^(/(/i, z)); where: 2; ~ //; 

i.e., the /3-quantile value and /3-truncated mean of the (random) testing error of hypothesis 
h if the testing sample follows distribution jjl. We have the following theorem that is a 
special case of Theorem [T3l hence we omit the proof. 

Theorem 7 (Quantile Value 8z Truncated Mean) Supposes aren i.i.d. samples drawn 

according to fi, and denote the empirical distribution ofs by /Ucmp- Let Aq = y — "^ ^ . 

/fO</3 — Ao < /3 + Xq < 1 and A is (K, e(s)) robust, then with probability at least 1 — S, 
the followings hold 

(I) Q {As, P - Ao, Mcmp) - e(s) < Q (^s, (3,^i)<Q (A, /? + Aq, /iemp) + e(s); 
(//) r(A,/3-Ao,^cmp)-e(s) <T{As,/3,fi) < T (A,/3 + Ao,/Ucmp) + e(s). 

In words, Theorem[7]essentially means that with high probability, the /3-quantile value/truncated 
mean of the testing error (recall that the testing error is a random variable) is (approxi- 
mately) bounded by the (/3± Ao)-quantile value/truncated mean of the empirical error, thus 
providing a way to estimate the quantile value/truncated expectation of the testing error 
based on empirical observations. 



3.3 Markovian samples 



The robustness approach is not restricted to the IID setup. In many apphcations of interest, 
such as reinforcement learning and time series forecasting, the IID assumption is violated. 
In such applications there is a time driven process that generates samples that depend on 
the previous samples (e.g., the observations of a trajectory of a robot). Such a situation can 
be modeled by stochastic process such as a Markov processes. In this section we establish 
similar result to the IID case for samples that are drawn from a Markov chain. The state 
space can be general, i.e., it is not necessarily finite or countable. Thus, a certain ergodic 
structure of the underlying Markov chain is needed. We focus on chains that converge to 
equilibrium exponentially fast and uniforml y in the initial condi t ion. I t is known that this is 
equivalent to the class of of Doeb l in chains (IMevn and Tweedid . Il993l ) . Recall the following 
definition (cf lMevn and Tweedid . Il993l : lDoobl . ll953l )). 



Definition 8 A Markov chain {zi}^^ on a state space Z is a Doeblin chain (with a and 
T) if there exists a probability measure if on Z, a > 0, an integer T > 1 such that 

Pr(2T € H\zq = z) > aip{H); ^ measureable H C Z; Vz £ Z. 

The class of Doeblin chains is probably the "nicest" class of general state-space Markov 
chains. We notice that such assumption is not overly restrictive, since by requiring that an 
ergodic theorem holds for all b ounded functions un i forml y in the initial distribution itself 
implies that a chain is Doeblin (JMevn and Tweedid . Il993l ). In particular, an ergodic chain 
defined on a finite state-space is a Doeblin chain. 

Indeed, the Doeblin chain condition guarantees that an invaria nt measure n exists. 



Furth ermore, we have the following lemma adapted from Theorem 2 of lGlynn and Ormoneit 

(|2ooi). 



Lemma 9 Let {zi} be a Doeblin chain as in Definition\^ Fix a function f : Z 
that ll/lloo ^ C. Then for n > 2CT/ea the following holds 



Pr 



(^P^"'-L 



f{z)iT{dz)s > e < exp 



a^{ne-2CT/af 
2nC2r2 



such 



The following is the main theorem of this section that establishes a generalization bound 
for robust algorithms with samples drawn according to a Doeblin chain. 



Theorem 10 Let s = {si,--- ,s„} be the first n outputs of a Doeblin chain with a and 
T such that n > 2T/a, and suppose that A is {K,€{s)) -robust. Then for any 5 > 0, with 
probability at least 1 — S, 



KA)-Up(A) <e{s) + M 



8T2(Kln2 + ln(l/5))V/^ 



a^n 



Proof We prove the following slightly stronger statement: 



l{As) - IcmpiAs) < e{s) + MJ — Jy/2n{K\n2 + ln(l/5)) + 2 



an 



(5) 



Let Ao = J ^J y^2n{K ln2 + ln{l/6)) +2, we have that Aq > y^2T/an. Since n > 2T/a, 
we have n > y^2Tn/a, which leads to 



n > 



2T 



> 



2T 



a 



sJWJom aAo 



Let Ni be the set of index of points of s that fall into the Cj. Consider the set of functions 
H. = {l(x € H)\H = UiGj C'i; V/ C {!,••• , -f^}}, i.e., the set of indicator functions of all 
different unions of Cj. Then \'H\ = 2 . Furthermore, fix a /iq G T-L, 



K 



pr(E 



i=i 



lA^, 



n 



vr(C,; 



>A) 



1 " 
=Pr| sup[- y /i(si) - E^/i(s)l > A j 

<2^Pr[- V hoisi) - E^hois) > A]. 

n ^ — ^ 



i=l 



Since ||/io||oo = 1, we can apply Lemma [9] to get for n > 2T/Xa 

Pr[- ^ h^[si) - E,/io(s) > A] < exp f ^ ^^^, ' ' 

i=l 



Substitute in Aq, 



K 



Pr(E 



liV.- 



n 



1T(C,] 



>A„)<2'-e.p(- °''"^°;^f/°>" l^.. 



Thus, dni) follows by an identical argument as the proof of Theorem [3j 

To complete the proof of the theorem, note that n > 2T/a implies n > 2, hence 
Y/2n(Kln2 + ln(l/(5)) > 2. Therefore, 

\ —\/j2^(Kln 2 + ln(lJ5)) + 2 < J— A/2y2^(A'ln2 + ln(l/5)) 
V an ^ V an ^ 



8T^(Ann2 + ln(l/(5)) \ 
a'^n J 



1/4 



and the theorem follows. 



10 



4. Pseudo Robustness 

In this section we propose a relaxed definition of robustness tliat accounts for the case 
where Equation ([2]) holds for most of training samples, as opposed to Definition [6] where 
Equation ([2]) holds for all training samples. Recall that the size of training set is fixed as 
n. 

Definition 11 Algorithm A is {K, e(s),?i) pseudo robust if Z can be partitioned into K 
disjoint sets, denoted as {Ci\^i, and a subset of training samples s with |s| = n such that 
Vs G s, 

s,zeCi, =^ \l{As,s)-l{As,z)\<e{s). 

Observe that {K, e(s))-robust is equivalent to {K, e(s),n) pseudo robust. 

Theorem 12 Ifs consists of n i.i.d. samples, and A is (iC, e(s),n) pseudo robust, then for 
any (5 > 0, with probability at least 1 — 6, 



,/ . ^ , /.N ^^/N ..h^-n /2Kln2 + 21n(l/5) 
n \ n V n 



Proof Let Ni and Ni be the set of indices of points of s and s that fall into the Ci, 
respectively. Similarly to the proof of Theorem [3l we note that (|A'^i|, • • • , |A'^j^|) is an IID 
multinomial random variable with parameters n and (//(Ci), • • • ,^{Ck))- And hence due 
to Breteganolle-Huber-Carol inequality, the following holds with probability at least 1 — 5, 



K 

E 

i=l 



\Ni\ 



n 



Ka] 



< 



2inn2 + 21n(l/(5) 



n 



(6) 



Furthermore, we have 



K 



X;e(/(a, z)\z e Ci)ti{a) -^Yl ^(-^-' ^ 



< 



K 



Y,Hl{^s,z)\zeCi) 



\Ni 



i=l 
n 



i=l 



n n 



Yl{As,s.i) 



i=l 



+ 



K 



K 



Y,¥.{l{A,,z)\z e Q)/i(Q) - ^E(/(A,^)k e C,) 



\Nr 



i=l 



i=l 



n 



< 



1 ^ 

-Y[\Ni\xE{l{As,z)\zea)-YliAs,Sj)- Yl ^("4s,Si)] 



j=i 



+ 



jeNi 



jeNi,j^Ni 



K 



Ni 



msiK\l{As,z)\'yZ\ — ~ -^(Ci) 
z&z ^-^ I n 



i=l 
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Note that due to the triangle inequahty as well as the assumption that the loss is non- 
negative and upper bounded by M, the right-hand side can be upper bounded by 



1 ^ 

- V V max \l{As, Sj) - l{As, Z2) 

n ^ ^ z2eCi 



j&N, 



+ 



1 ^ 






K 



n 



n , . n — fi ^ ^ , ^ v^ 
<-e(s) + M + My 



%=\ 



\N, 



n 



Ka) 



where the inequality holds due to definition of Ni and A'^;. The theorem follows by apply- 
ing dSD. ■ 



Similarly, Theorem [7] can be generalized to the pseudo robust case. The proof is lengthy 
and hence postponed to Appendix I A. 1[ 

Theorem 13 (Quantile Value & Truncated Expectation) Suppose s has n samples 
drawn i.i.d. according to /i, and denote the empirical distribution of s as /-icmp- Let Aq = 

V ~ — n ■ Suppose < /? — Ao — (n — h)/n < P + Xq + {n — h)/n < 1 and A is 

(K,e{s),n) pseudo robust. Then with probability at least 1 — 5, the followings hold 

I fi — -fi \ 

(/) Q{As,l3-Xo ^, /^emp - e(s) 



n 



n — n 



< Q {As, /3, /u) < Q A, /3 + Ao + , /iemp + e(s); 



n 



Ti — iz 
{II) T(A,/3-Ao ;^,/iemp 1 -e(s) 



n 



11 — ji 
<T{As,(3,fi) <r (^s,/3 + Ao + — ^,^fcmp ) +e(s). 



5. Examples of Robust Algorithms 

In this section we provide some examples of robust algorithms. The proofs of the exam- 
ples can be found in Appendix. Our f irst example is Majority Voting (MV) classification 
(cf Section 6.3 of iDevrove et al.l . Il996l ) that partitions the input space X and labels each 
partition set according to a majority vote of the training samples belonging to it. 



Example 3 (Majority Voting) Let y = {—1,+!}. Partition X to Ci, ■ ■ ■ ,Ck, and use 
C{x) to denote the set to which x belongs. A new sample Xa (z X is labeled by 



•^sy^a) 



— 1, otherwise. 
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If the loss function is l{As,z) = f{z\y,As{z\x)) for some function f, then MV is {2K,0) 
robust. 

MV algorithm has a natural partition of the sample space that makes it robust. Another 
class of robust algorithms are those that have approximately the same testing loss for testing 
samples that are close (in the sense of geometric distance) to each other, since we can 
partition the sample space with norm balls. The next theorem states that an algorithm is 
robust if two samples being close implies that they have similar testing error. 

Theorem 14 Fix 7 > and metric p of Z. Suppose A satisfies 

\l{As,zi) -l{As,Z2)\ < e(s), yzi,Z2 : zi G s, ^(2:1,^2) < 7, 

and Af{'y/2,Z, p) < 00. Then A is {M{'^/2,Z, p), e{s)^ -robust. 

Proof Let {ci, • • • , c_/\/(^/2,^,p)} be a 7/2-cover of Z. whose existence is guaranteed by the 
definition of covering number. Let Ci = {z G ^\p{z, Cj) < 7/2}, and Ci = Cif][ U7=i ^j) • 
Thus, Ci, • • • , C'A^('y/2,2,p) is a partition of Z, and satisfies 

zi,Z2eCi =^ pizi,Z2) < p{zi,Ci) + p{z2,Ci) <'y. 



Therefore, 
implies 



|/(A,^i) -^(A,^2)! < e(s), yzi,Z2 : zi es, p{zi,Z2) < 7, 



zi e s zi,Z2 eCi =^ \l{As,zi) -l{As,Z2)\ < e(s), 

and the theorem follows. ■ 

Theorem 1141 immediately leads to the next example: if the testing error given the output of 
an algorithm is Lipschitz continuous, then the algorithm is robust. 

Example 4 (Lipschitz continuous functions) If Z is compact w.r.t. metric p, l{As,-) 
is Lipschitz continuous with Lipschitz constant c(s), i.e., 

\l{As,Zi) -l{As,Z2)\ < c{s)p{zi,Z2), Vzi,Z2 G 2, 

then A is (7V(7/2, Z, p), c{s)"f) -robust for all 7 > 0. 

Theorem [T3] also implies that SVM, Lasso, feed-forward neural network and PCA are 
robust, as stated in Example [5] to Example [8l The proofs are deferred to Appendix lA. 31 
to [Ml 

Example 5 (Support Vector Machine) Let X be compac t . Con sider the standard SVM 



formulation /(Cortes and Vapnili . \l993i : IScholkopf and Smold . \200B) 



1 " 

Minimize:-^ d c\\w\\'ii -\ — > £, 
' II II 7-1. ^ — ^ 



n 



s. t. 1 - Siiy[{w, (pisii^)) +d]< Ci] 

Ci > 0. 
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Here (j){-) is a feature mapping, \\ ■ ||-^ is its RKHS kernel, and k{-, •) is the kernel function. 
Letl{-,-) be the hinge-loss, i.e., l[{w,d), z) = [I — z\y{{w,(p(z\x)) +d)]'^ , and define ffi{'y) = 
^^^'^a,h£X ,\\a-h\\2<-y (^(a, a) + A;(b, b) — 2fc(a, b)) . Ifk{-,-) is continuous, then for anyy > 0, 
fnil) is finite, and SVM is {2M{-(/2,X, \\ ■ II2), yJfnilMc) robust. 

Exam ple 6 (Lasso) L et Z be compact and the loss function be l{As,z) = \z\y — As{z\x)\- 



'F 

£ 



Lasso /(Tibshirani . \l99d ). which is the following regression formulation: 

1 " 

• ~ ^i^i\y ~ w^ Si\^f + c||w^||i, (7) 



mm 

w n 

i=l 



is (AA(7/2,^, II • lloo), (y(s)/c + 1)7) -ro6usi /or aZ/ 7 >0, where Y{s) ^ ^ EILi « 



^y 



Example 7 (Feed- forward Neural Networks) Let Z be compact and the loss function 
be 1{Astz) = \z\y — As{z\x)\. Consider the d-layer neural network (trained on s), which is 
the following predicting rule given an input x ^ X 

yv = l,--- ,d-l: <:=a(^<-^x}'-i); i=l,---,N„; 

Nd-i 

A.{x):=a{^w^-'x^-'); 

i=i 

If there exists a, j3 such that the d-layer neural network satisfying that \a{a)—a{h)\ < I3\a—b\, 
and X^j^i l^ijl < a for all v,i, then it is (^J\f{'y/2,Z, \\ ■ \\oo)-, a"^ l^'^'j) -robust, for all 7 > 0. 

We remark that in Example [71 the number of hidden units in each layer has no effect on the 
robustness of the alKorithm and consequently the bound on the testing error. This indeed 



agrees with iBartlettI (jl998l ) , where the author showed (using a different approach based on 
fat-shattering dimension) that for neural networks, the weight plays a more important role 
than the number of hidden units. 

The next example considers an unsupervised learning algorithm, namely the principal 
component analysis. We show that it is robust if the sample space is bounded. Note that, 
this does not contradict with the well known fact that the principal component analysis is 
sensitive to outliers which are far away from the origin. 

Example 8 (Principal Component Analysis (PCA)) Let Z C M™, such thatraaxzi^z \\z\\2 < 
B. If the loss function is I {{wi,- ■ ■ jU^d), -z) = X]fc=i(''^A: •^)^' then finding the first d principal 
components, which solves the following optimization problem of wi, ■ ■ ■ ,Wd & K"% 

n d 

Maximize: 2, /.(^fe ^i) 
i=l fc=i 

Subject to: ||wfc||2 = 1, k = l,---,d; 
wjwj = 0, i^ j. 
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is (A/'(7/2,Z, II • \\2),2d-fB)-rohust. 

The last example is large-margin classificatio n, which is a g eneralization of Example [TJ 



We need the following standard definition (e.g., iBartletd . Il998l ) of the distance of a point 
to a classification rule. 

Definition 15 Fix a metric p of X. Given a classification rule A and x ^ X, the distance 
of X to A is 

V{x,A) = inf{c > 0|3x' G X : p{x,x') < c, A(x) / A(x')}. 

A large margin classifier is a classification rule such that most of the training samples 
are "far away" from the classification boundary. 

Example 9 (Large-margin classifier) // there exist 7 and h such that 

n 

Y^l{V{si\^,As)>7) >h, 

4 = 1 

then algorithm A is {2J\f{'y/2,X, p),0,n) pseudo robust, provided that M{'y/2,X, p) < 00. 
Note that if we take p to be the Euclidean norm, and let h = n, then we recover Example [TJ 

6. Necessity of Robustness 

Thus far we have considered finite sample generalization bounds of robust algorithms. We 
now turn to asymptotic analysis, i.e., we are given an increasing set of training samples 
s = (si,S2,-'') and tested on an increasing set of testing samples t = (ii,t2,---)- We 
use s(n) and t(n) to denote the first n elements of training samples and testing samples 
respectively. For succinctness, we let C{-,-) to be the average loss given a set of samples, 
i.e., for h £ Ti, 

1 " 

jr{h,t{n)) = -y^l{h,ti). 
n ^-^ 

We show in this section that robustness is an essential property of successful learning. 
In particular, a (weaker) notion of robustness characterizes generalizability, i.e., a learning 
algorithm generalizes if and only if it is weakly robust. To make this precise, we define the 
notion of generalizability and weak robustness first. 

Definition 16 1. A learning algorithm A generalizes w.r.t. s if 

limsup|Ei(z(^s(„),t)j -£(^s(„),s(n))| < 0. 

2. A learning algorithm A generalize w.p. 1 if it generalize w.r.t. almost every s. 
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We remark that the proposed notion of generaUzabihty differs shghtly from the standard one 
in the sense that the latter requires that the empirical risk and the expected risk converges 
in mean, while the proposed notion requires convergence w.p.l. It is straightforward that 
the proposed notion implies the standard one. 

Definition 17 1. A learning algorithm A is weakly robust w.r.t s if there exists a 
sequence of {Vn Q ^"} such that Pr(t(n) S P„) — t- 1, and 

limsup<^ niax [/:(^g(„),s(n)) - /:(^s(„),s(n))] I < 0. 

n |^s(n)eX'n J 

2. A learning algorithm A is a.s. weakly robust if it is robust w.r.t. almost every s. 

We briefly comment on the definition of weak robustness. Recall that the definition of 
robustness requires that the sample space can be partitioned into disjoint subsets such that 
if a testing sample belongs to the same partitioning set of a training sample, then they 
have similar loss. Weak robustness generalizes such notion by considering the average loss 
of testing samples and training samples. That is, if for a large (in the probabilistic sense) 
subset of Z"", the testing error is close to the training error, then the algorithm is weakly 
robust. It is easy to see, by Breteganolle-Huber-Carol lemma, that if for any fixed e > 
there exists K such that A is {K,e) robust, then A is weakly robust. 

We now establish the main result of this section: weak robustness and generalizability 
are equivalent. 

Theorem 18 An algorithm A generalizes w.r.t. s if and only if it is weakly robust w.r.t. s. 

Proof We prove the sufficiency of weak robustness first. When A is weakly robust w.r.t. 
s, by definition there exists {Dn} such that for any 6, e > 0, there exists N(6,€) such that 
for all n > N{6, e), Pr(t(n) G A,) > I - S, and 

sup £(A(„),s(n)) -/:(yls(„),s(n)) < e. (8) 

s{n)eDn 

Therefore, the following holds for any n > N{5, e), 

^t{l{A{n),t)) -^Ms(n),s(n)) 
=Et{n)('C(A(n),t(^))) -'C(A(n),s(n)) 

=Pr(t(n) D„)E(/:(A(n), t(n))|t(n) D„) +Pr(t(?i) G n„)E(/:(A(n), t(n))|t(?i) G Z?„ 

-'C(A{n),s(n)) 
<6M+ sup {/:(A{n),s(n))-/:(>l,(„),s(n))} <5M + e. 

s(n)eD„ 

Here, the first equality holds by i.i.d. of t(n), and the second equality holds by conditional 
expectation. The inequalities hold due to the assumption that the loss function is upper 
bounded by M, as well as dH). 
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We thus conclude that the algorithm A generalizes for s, because e, 5 can be arbitrary. 
Now we turn to the necessity of weak robustness. First, we establish the following 
lemma. 

Lemma 19 Given s, if algorithm A is not weakly robust w.r.t. s, then there exists e*, S* > 
such that the following holds for infinitely many n, 

Pr(/:(A(n),t(n)) > £(A(n),s(n)) + e*) > 6*. (9) 

Proof We prove the lemma by contradiction. Assume that such e* and 5* do not exist. 
Let €y = 61; = 1/v for V = 1, 2 • • • , then there exists a non-decreasing sequence {N{v)}^i 
such that for all v, if n > N{v) then Pr( £(^s(„),t(n)) > £(^s(„),s(n)) + e^, j < 5^. For 
each n, define the following set: 

^n - {s(n)|/:(A(n),s(n)) - C{As(n),s{n)) < ^v}- 

Thus, for n > N{v) we have 

Pr(t(n) G p;;) = 1 - Pr(/:(A{n),t(n)) > £(A(n),s(n)) + e„) > 1 - 6,. 

For n > A^(l), define T>n — 'Dn \ where: v{n) = max (t'jA^(t) < n; w < n). Thus for 
all n > N{1) we have that Pr(t(n) € Vn) > 1 — (5„(„) and supg(„)g25„ ^(-^s(n);^(n)) — 
>C(^s(n))S(n)) < e„(„). Note that v{n) t oo, it follows that (5„(„) — > and e„(„) -^ 0. 
Therefore, Pr(t(n) G D,„) — )■ 1, and 

limsupj sup /:(A(„),s(n)) -£(^s(„),s(n))| < 0. 
That is, A is weakly robust w.r.t. s, which is a desired contradiction. ■ 

We now prove the necessity of weak robustness. Recall that l{-, •) is uniformly bounded. 
Thus by Hoeffding's inequality we have that for any e, 6, there exists n* such that for any 
n > n*, with probability at least 1 — 6, we have ^ y^ILi liAf^fn^, tj) — KtiliA^(n-\, t)) < e. 
This implies that 

Since algorithm A is not robust. Lemma [19] implies that Q holds for infinitely many n. 
This, combined with Equation (|10p implies that for infinitely many n. 



/:(A(„),t(n))-Ei/(A(„),t) AO. (10) 



e 

which means that A does not generalize. Thus, the necessity of weak robustness is estab- 
lished. ■ 



Theorem 1181 immediately leads to the following corollary. 
Corollary 20 An algorithm A generalizes w.p. 1 if and only if it is a.s. weakly robust. 
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7. Discussion 

In this paper we investigated the generahzation abihty of learning algorithm based on their 
robustness: the property that if a testing sample is "similar" to a training sample, then 
its loss is close to the training error. This provides a novel approach, different from the 
complexity or stability argument, in studying the performance of learning algorithms. We 
further showed that a weak notion of robustness characterizes generalizability, which implies 
that robustness is a fundamental property for learning algorithms to work. 

Before concluding the paper, we outline several directions for future research. 

• Adaptive partition: In Definition [2] when the notion of robustness was introduced, we 
required that the partitioning of Z into K sets is fixed. That is, regardless of the 
training sample set, we partition Z into the same K sets. A natural and interesting 
question is what if such fixed partition does not exist, while instead we can only 
partition Z into K sets adaptively, i.e., for different training set we will have a different 
partitioning of Z. Adaptive partition setup can be used to study algorithms such as 
k-NN. Our current proof technique does not straightforwardly extend to such a setup, 
and we would like to understand whether a meaningful generalization bound under 
this weaker notion of robustness can be obtained. 

• Mismatched datasets: One advantage of algorithmic robustness framework is the abil- 
ity to handle non-standard learning setups. For example, in Section 13.21 and 13.31 we 
derived generalization bounds for quantile loss and for samples drawn from a Marko- 
vian chain, respectively. A problem of the same essence is the mismatched datasets, 
where the training samples are generated according to a distribution slightly different 
from that of the testing samples, e.g., the two distributions may have a small K-L di- 
vergence. We conjecture that in this case a generalization bound similar to Theorem[3] 
would be possible, with an extra term depending on the magnitude of the difference 
of the two distributions. 

• Outlier removal: One possible reason that the training samples is generated differently 
from the testing sample is outlier corruption. It is often the case that the training 
sample set is corrupted by some outliers. In addi t ion, algorithms designed to be out - 



lier resistent abound in the literature (e.g.. lHuberlll98ll : lRousseeuw and Lerovl . 119871 ). 



The robust framework may provide a novel approach in studying both the generaliza- 
tion ability and the outlier resistent property of these algorithms. In particular, the 
results reported in Section 13.21 can serve as a starting point of future research in this 
direction. 

Consistency: We addressed in this paper the relationship between robustness and gen- 
eralizability. An equally important feature of learning algorithms is consistency: the 
property that a learning algorithm guarantees to recover the global optimal solution as 
the number of training data increases. While it is straightforward that if an algorithm 
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minimizes the empirical error asymptotically and also generalizes (or equivalently is 
weakly robust), then it is consistent, much less is known for a necessary condition 
for an algorithm to be consistent. It is certainly interesting to investigate the rela- 
tionship between consistency and robustness, and in particular whether robustness 
is necessary for consistency, at least for algorithms that asymptotically minimize the 
empirical error. 

• Other robust algorithms: The proposed robust approach considers a general learning 
setup. However, except for PCA, the algorithms investigated in SectionOall belong to 
the supervised learning setting. One natural extension is to investigate other robust 
unsupervised and semi-supervised learning algorithms. One difficulty is that compared 
to supervised learning case, the analysis of unsupervised/semi-supervised learning 
algorithms can be challenging, due to the fact that many of them are random iterative 
algorithms (e.g., k- means). 

Appendix A. Proofs 
A.l Proof of Theorem [T3] 

We observe the following properties of quantile value and truncated mean: 

1. If X is supported on M^ and /3i > /32, then 

Q^i(X) > Q/^^X); T^'{X) > T^2(x). 

2. If Y stochastically dominates X, i.e., Pr(y > a) > Pr(X > a) for all a € M, then for 

any /3, 

Q^ (y ) > Q^ (X) ; T^^ (Y) > T^ (X) . 

3. The /3-truncated mean of empirical distribution of nonnegative (xi, • • • ,x„) is given 

by 

n 

min > ctiXi. 

By definition of pseudo-robustness, Z can be partitioned into K disjoint sets, denoted 
as {Ci}j^^, and a subset of training samples s with |s| = h such that 

zi es, zi,Z2 eCi, =^ \l{As, zi) - l{As, Z2)\ < e{s); Vs. 

Let Ni be the set of index of points of s that fall into the Cj. Let £ be the event that 
the following holds: 



K 

E 



Ai(Ci 

n 



'2A'ln2-F21n(l/5) 
n 
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From the proof of Theorem [3l Pr(£') > 1 — 6. Hereafter we restrict the discussion to the 
case when £ holds. 
Denote 

Vj = argmin/(^s> -z)- 

zeCj 

By symmetry, without loss of generality we assume that < l{As,vi) < l{As,V2) < • • • < 
1{As,vk) < M. Define a set of samples s as 

J Si if Si G s; 
Si = < 

\^ Vj if Si s, Sj G Cj. 

Define discrete probability measures fi and jl, supported on {vi, • • • ,vk} as 



\Nj\ 



n 

Further, let /iemp denote the empirical distribution of sample set s. 
Proof of (I): 
Observe that /_i stochastically dominates fl, hence 

QiA,,f3,fi)<Q{As,f3,fi). (11) 

Also by definition of Q(-) and fi, 

k 

Q(A,/3,A) = ^fc*; where: k* = mm{k : ^/i(fj) > f3}. 

Let s be the set of all samples Si such that Si € s, and Si G Cj for some j < k* . Observe 
that 

Vsi G s : Z(A, Si) < Vk* + e(s) = Q(A, /3, A) + e(s). (12) 

Note that £ implies 

- ^ ^ 1 > Y,fi{C,) - Ao = j;/i(^i) - Ao > /? - Ao. 

Since ^s is pseudo robust, we have 

1 Y^ n — n 

n ^-^ n 

Therefore 

-E E i>-EEi--Ei^/5-^o-^^^. 

Thus, s is a subset of s of at least n(/3 — Xq — (n — n)/n) elements. Thus (jlip and (J12p lead 
to 

Q(A, /3 - Ao - (n - n)/n, ^omp) < maxjsi : Sj G s} < Q(A, /5, /i) + e(s). 
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Thus, we establish the left inequality. The proof of the right one is identical and hence 
omitted. 

Proof of (II): 

The proof constitutes four steps. 

Step 1: Observe that /U stochastically dominates /i, hence 

rMs,/3,/i)<rMs,/3,/u). 

Step 2: We prove that 

r(A,/3-Ao,/i)<r(A,/3,A)- 

Note that t £ implies for all j, we have 

fx{{vi,--- ,Vj}) - Ao < ji{{vi,--- ,Vj}), 
Therefore, there uniquely exists a non-negative integer j* and a c* G [0, 1) such that 

K{vi,--- ,Vj*}) + c*ji{{vj*+i}) = 13, 

and define 

j* 
/3 = ^mm{fi{{vi}), fl{{vi})) + c* mm{fl{{vj*+i}), fi{{vj*+i})), (13) 

1=1 

then we have /3 > /3 — Aq, which leads to 

T{As,(3-\o,fi)<T{AsJ,fl) 

i=l 
j* 

< Y^ l{As, vi)fi{{v^}) + c*l{As, fj-.+OAdt-j'+i}) = T(A, /?, A), 

where (a) holds because Equation ([13]) essentially means that T {As, $ , fj) is a weighted 
sum with total weights equals to /3, which puts more weights on small terms, and hence is 
smaller. 

Step 3: We prove that 

T{As,P - AcAemp) -e(s) < T(A,/3- Ao,/i). 

Let t be a set of n samples, such that Nj of them are Vj for j = 1,- ■ ■ ,K. Observe that /x 
is the empirical distribution of t. Further note that there is a one-to-one mapping between 
samples in s and that in t such that each pair (say Si,ti) of samples belongs to the same 
Cj. By definition of s this guarantees that |/(w4s, Sj) — l{As,ti)\ < e(s), which implies 

T(A,/3 - AcAemp) -e(s) < T(A,/3- Ao,/i). 
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Step 4: We prove that 

Tl — Tl 
T{As, /? - Ao , /icmp) < T{As, P - Ao, /Xcmp)- 

n 
Let I = {i : Si = Sj}), the following holds: 

n ^ n 

y^ OiKA, Sj) > ^ ai/(^s, Si) = ^ aJ(-^s, Si); Vq : < Qj < -; y^g^ = /3- Ap. 

4=1 iei iei j=i 

Note that \{i ^ I}\ = n - h, then Eign"i > /3 - Aq - ^. Thus we have Va : < Oj < 
^; Er=i«i = /3-Ao, 



Eail{As,Si)> min V a./(A, Sj) = T(A,/3 - Aq, Acmp)- 

iGl '^':0<ai<-,Ei=iai</3-Ao ^ j^i 



Therefore, 

n ^ ^ n 

y^ail{As,Si) > T(A,/3- Ao ,^cmp); Va : < ai < -; V'aj = /3- Aq. 

j=i j=i 

Minimization over a on both side. We proved 

Tl — Tl 
T{As, (3 - Xq , /iemp) < T{As, 13 - Ao, /Xomp)- 

Combining all four steps, we proved the left inequality, i.e., 

Tl — Tl 

T{As,f3 - Aq , //emp) - e(s) < T{As, /5, /i). 

Tl 

The right inequality can be proved identically and hence omitted. 

A. 2 Proof of Example [3] 

We can partition Z as {—1} xCi, • • • , {—1} x Ck, {+1} x Ci, • • • , {+1} xCk- Consider Za, Zb 
that belong to a same set, then Za\y = zi,\y, and 3i such that Zaix^^blx S Ci, which by the 
definition of Majority Voting algorithm implies that As{za\x) = As{z^x)- Thus, we have 

l{As,Za) = f{Za\y,As{Za\x)) = f {Zb\y , As{Zb\x)) = l{As,Zb). 

Hence MV is {2K, 0)-robust. 

A. 3 Proof of Example [S] 

The existence of f-uil) follows from the compactness of X and continuity of k{-,-). 

To prove the robustness of SVM, let {'w*,d*) be the solution given training data s. To 
avoid notation clutter, let yi = Si\y and Xi = Sju. Thus, we have (due to optimality of 
w*,d*) 

-.IT- 1 '^ 

i=l 1=1 
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which imphes ||ii'*||-H < \J^T^- Let ci, • • • ,c_/\^(-y/2,A',||-||2) be a 7/2-cover of X (recah that X 
is compact), then we can partition Z as 2J\[{'^/2^X , \\ ■ II2) sets, such that if (yi,a;i) and 
(1/2, X2) belongs to the same set, then yi = 2/2 and ||xi — X2II2 < 7/2. 
Further observe that if yi = y2 and ||xi — X2II2 ^ 7/2, then 

\l{iw*,d*),Z,)-l{iw*,d*),Z2))\ 

= I [1 - yi{{w*, 0(xi)) + d*)]+ - [1 - y2{{w*, cf>{x2)) + d*)]+| 

<\{w*,(t>{xi)-4>{x2))\ 

< \\w*\\nV{4>{xi) -(I){x2),4>{xi) -4>{x2)) 



<VhM/c- 

Here the last inequality follows from the definition of f-^. Hence, the example holds by 
Theorem [141 

A. 4 Proof of Example [6] 

It suffices to show the following lemma, which establish that loss of Lasso solution is Lipts- 
chitz continuous. 

Lemma 21 Ifw*{s) is the solution of Lasso given training set s, then 

1 " 

\l{w*{s),Za) -l{w*{s),Zb)\ < [—'^Si\y'^ + l]||Za-^fe||oo. 

i=l 

Proof For succinctness we let yi = Si\y, Xi = Sju for i = !,••• ,n. Similarly, we let 
ya{b) = ^a{b)\y ^nd Xa(b) = -2a(fe)|x- Since w*{s) is the solution of Lasso, we have (due to 
optimality) 

-• n 1 " 1 '^ 

- Y.^y^ - Xjw*{s)f + C\\W*{S)\U < - Y,iy^ - ^I^f + C||0||l = - E y*'' 

which implies ||u;*||i < ^Yll=iyi^- Therefore, 

\l{w*{s),Za) -l{w*{s),Zi,)\ =\\ya -W*{s)Xa\ - \yh - W* {s)xi,\\ 

<\{ya- W*{s)Xa) - {yb - W*{s)xb)\ 
<\ya - yb\ + ||lt;*(s)||i||Xa - XbWoo 
<i\\w*{s)\\i + l)\\Za-Zh\\oc 
1 '" 

Here the first two inequalities holds from triangular inequality, and the last inequality holds 
due to z = {x,y). ■ 
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A. 5 Proof of Example [7] 

To see why the example holds, it suffices to show the following lemma, which establishes 
that the neural network mentioned is Lipschitz continuous. For simplicity, we write the 
prediction given x ^ X as NN[x). 

Lemma 22 Fixed a, (3, if a d-layer neural network satisfying that |o"(a) — cr(6)| < /3|a — h\, 
and ^jli \w'"-\ < Oi for all v, i, then the following holds: 

\1{A,, z) - l{A,,z)\ < (1 + a''p^)\\z - z|U. 

Proof Let x\ and x^ be the output of the i*^ unit of the v*^ layer for samples z and 
z respectively. Let x" and x^' be the vector such that the i*'* elements are x^ and £" 
respectively. From Yl,i=i \wj\ < oi we have 









N^u N^v 




<- 


^i — 








Af„ Af„ 






</3 


Y.«-"-ll<-V 










i=i j=i 





</3a||x''-i-x^'-i|U. 



Here, the first inequality holds from the Lipschitz condition of a, and the second inequality 
holds from Ylij=i l^ijl — ^- Iterating over d layers, we have 



\NN{z\^) - NN{z\^)\ = \x'^ - x'^\ < a'^/3'^||x - x| 



which implies 



\liA,, z) - 1{A,, z)\ = \ \z^y - iViV(z|,)| - |2|, - iViV(£|,)|| 
<||z|^-£|J + |iV7V(z|J-iViV(z|J| 

<(l + a'^/3'^)||z-z||oo. 
This proves the lemma. 

A. 6 Proof of Example [8] 

We show that the loss to PCA is Lipschitz continuous, and then apply Theorem 1141 
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Let (i(;^(s), • • • , 'w'^{s)) be the solution of PCA trained on s. Thus we have 

|/(K(S),--- ,W*a{s)),Za)-l{iwl{s),--- ,W*i{s)),Zb)\ 

k=l k=l 

d 

fc=l 

<2dB\\Za - Zb\\2, 

where the last inequality holds because ||it'^(s)||2 = 1 and ||za||, ||zb|| < B. Hence, the 
example holds by Theorem 1141 

A. 7 Proof of Example [9] 

Set s as 

s = {si £ s\V{si,As) > 7}. 

And let ci, • • • , c_\f('y/2,x,p) be a 7/2 cover of X. Thus, we can partition Z to 2A/'(7/2, X, p) 
subsets {Cj}, such that if 

Zl,Z2£Ci] =^ yi = 2/2; &yO(a:i,X2) < 7. 

This implies that: 

zi £ s, zi, Z2 e Ci] =^ yi = y2; As{xi) = As{x2); =^ l{As,zi) = l{As,z2). 
By definition, A is {2J\f{'y/2, X, p),0,h) pseudo robust. 
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